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Rapid advances of the next-generation sequencing technologies have allowed whole 
genome sequencing of many species. However, with the current sequencing technologies, 
the whole genome sequence assemblies often fall in short in one of the four quality 
measurements: accuracy, contiguity, connectivity, and completeness. In particular, small- 
sized contigs and scaffolds limit the applicability of whole genome sequences for genetic 
analysis. To enhance the quality of whole genome sequence assemblies, particularly 
the scaffolding capabilities, additional genomic resources are required. Among these, 
sequences derived from known physical locations offer great powers for scaffolding. In 
this mini-review, we will describe the principles, procedures and applications of physical- 
map-derived sequences, with the focus on physical map contig-specific sequences. 

Keywords: physical map contig-specific sequences, BAC end sequences, whole genome sequencing, assembly, 
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INTRODUCTION 

Whole genome sequencing is the most robust approach to unrav- 
eling the genetic information of an organism. During the past 
several years, sequencing costs have declined drastically through 
the use of next-generation sequencing technologies, which include 
a suite of sequencing platforms, such as Illumina sequencing, 
SOLiD sequencing, and PacBio sequencing, among quite a few 
others. As a result, more and more species with biological or eco- 
nomic importance are added to the list of organisms whose whole 
genomes have been sequenced. 

However, each of the next-generation sequencing platforms has 
distinctive shortcomings. For instance, Illumina sequencing and 
SOLiD sequencing generate accurate but short tags that are dif- 
ficult to be assembled (Nagarajan etal., 2010; Luo etal., 2012). 
PacBio sequencing generates long sequences, but its error rate 
is relatively high. These intrinsic shortcomings affect the assem- 
bly qualities of the whole genome sequences, which in turn limit 
the applicability of these whole genome sequences for genetic 
analysis. 

The assembly qualities of whole genome sequences are mea- 
sured by a number of parameters including the following: (1) 
accuracy; (2) completeness; (3) contiguity; and (4) connectiv- 
ity. Accuracy refers to the correctness of the sequences. It is an 
important metric as the miscalling of bases can cause substan- 
tial trouble for downstream operations, such as the identification 
of single nucleotide polymorphisms (SNPs). Sequencing accuracy 
is primarily intrinsic to the sequencing technology but can also 
be affected by the quality and quantity of the template DNA. In 
this regard, the Illumina and SOLiD sequencing platforms pro- 
vide fairly high-quality sequences, while the calling accuracy of 
the PacBio and Roche 454 sequencing platforms are relatively low. 
Completeness refers to the percentage of the total bases of the 



genome that are represented in the assembly of the whole genome 
sequences. Completeness is important because analysis involving 
the genes in missing regions will be difficult. Contiguity refers 
to the lengths of contiguous sequences. Continuous sequences 
allow full-length gene sequences, including regulatory sequences, 
to be obtained from the genome sequences for subsequent analy- 
sis. Short contigs pose greater challenges for the assembly of the 
genome into scaffolds, particularly with regards to correct order 
and orientation. Connectivity refers to the extent to which con- 
tigs are properly linked together and reflect their original genomic 
locations, sequential order, and orientation. For genetic analysis, 
connectivity is the most important metric. For instance, associ- 
ation analysis has the capability of revealing the significant SNPs 
associated with a specific trait. If the significant SNPs are located 
on genome sequences that are well connected at the chromo- 
somal scale, Manhattan plots can be constructed to determine 
the distribution of significant SNPs along the chromosome(s). 
The probabilities of the involved significant SNPs can be exam- 
ined to determine the location of the most significant SNP and 
how the linkage disequilibrium decays around that specific SNP, 
thereby determining the number and location of quantitative trait 
locus (QTLs) involved with the trait. In contrast, if the genome 
assembly is highly segmented, many significant SNPs remain as 
isolated contigs or scaffolds, and it will be difficult to determine 
the number and the location of the QTLs. Therefore, there is lim- 
ited use for highly segmented genome assemblies (Sierro etal., 
2013). 

In addition to the intrinsic characteristics of each sequenc- 
ing technology, the DNA templates used for sequencing can also 
add additional complexities. Heterozygous diploid organisms with 
two sets of similar chromosomes pose a challenge for assem- 
bly, because it is difficult to distinguish allelic sequences from 
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paralogous loci with high similarities (Hahn etal., 2014). The 
largest challenge of whole genome sequence assembly most likely 
comes from the presence of a large number of repetitive elements. 
Short tandem repeats over 100-bp long often cause a termination 
of the sequencing reaction, while longer, interspersed repeated 
sequences prevent short sequence tags from being assembled into 
long contigs. Repetitive sequences, such as transposons, in the 
genome shatter de novo assembly, because the sequencing reads 
are usually not long enough to span the entire series of repet- 
itive sequences plus any unique flanking sequences (Jiang etal., 
2013). Such challenges are more significant when dealing with 
species with complex genomes, such as teleost fish, which go 
through one or two additional rounds of whole genome duplica- 
tion (Meyer and Van de Peer, 2005; Steinke et al., 2006; Moghadam 
etal., 2011; Xu etal., 2011b). Assembly is also particularly prob- 
lematic for species with large genomes. For example, the Norway 
spruce has a genome size of 20 Gb, and only 25% of its genome 
is assembled into scaffolds longer than 10 Kb (Nystedt etal., 
2013). 

Several approaches are available for providing scaffolding capa- 
bilities. These include the generation of mate-paired reads from 
variable lengths of inserts (Boetzer etal., 2011; Gao etal., 2011; 
Gritsenko et al., 2012; Williams et al, 2012; Hunt et al, 2014; Kaji- 
tani etal., 2014; Zimin etal., 2014) or using transcript sequences 
(Mortazavi etal., 2010). Mate-paired reads can be generated from 
Illumina sequencing using libraries of various sizes, by using 
Fosmid libraries (Williams et al., 2012) or bacterial artificial chro- 
mosome (BAC) libraries (Xu et al, 2007; Liu et al., 2009). Although 
extremely efficient, the use of paired reads alone normally cannot 
reduce the number of scaffolds down to several thousand, as can 
be done with physical maps. Therefore, we have taken advantage 
of the available catfish BAC-based physical maps (Xu et al., 2007) 
and developed a method for generating BAC-based physical map 
contig-specific sequences (Jiang etal, 2013). Such physical map 
contig-specific sequences offer the capability to associate all the 
related genome sequence contigs/scaffolds belonging to a single 
physical map contig together, effectively reducing the overall num- 
ber of scaffolds of the genome sequences. Here we will describe the 
principles, procedures and applications of physical map-derived 
sequences. 

BAC-BASED PHYSICAL MAPS 

A BAC-based physical map consists of contigs of overlapping BAC 
clone DNA fragments. An acceptable BAC-based physical map 
usually consists of several thousand contigs. Any gaps can be 
attributed to missing segments of the genome or to highly compet- 
itive regions that cannot be properly assigned to specific contigs. 
Therefore, physical maps organize the entire genome into several 
thousand contigs. 

Early efforts in whole genome sequencing primarily relied on 
BAC clones selected from physical maps using a minimal tiling 
path (MTP, Mahairas etal, 1999; Siegel etal, 1999), and as such, 
the MTP can be selected through a graph-theoretical approach 
(Bozdag et al., 2013). Such a sequencing strategy has been referred 
to as the "clone-by-clone" whole genome sequencing strategy. With 
this approach, BAC clones selected from the physical map using 
an MTP are sequenced using random shotgun sequencing and 



assembly (Lander etal, 2001). The clone-by-clone sequencing 
strategy reduces the complexity of sequencing and assembly from 
the genome scale to a BAC clone, thus making it easier to assem- 
ble the genome. Such a whole genome sequencing strategy, which 
utilizes a BAC-based physical map, has been widely used in eukary- 
otes, such as human (Lander et al., 2001), mouse (Waterston et al., 
2002), chicken (International Chicken Genome Sequencing Con- 
sortium [ICGSC], 2004), zebrafish (Howe etal., 2013), medaka 
(Kasahara et al., 2007), Tetraodon (Jaillon et al, 2004), Arabidopsis 
(Arabidopsis Genome Initiative [AGI], 2000), and rice (Interna- 
tional Rice Genome Sequencing Project [IRGSP], 2005), among 
many others. However, it is very expensive and labor-intensive, 
especially for non-model species. 

The availability of next-generation sequencing technologies 
has led to greater efforts in the development of software pack- 
ages for the assembly of whole genome sequences. However, 
bioinformatic approaches alone cannot resolve the problems of 
repetitive sequences, especially with large genomes. As a result, 
large numbers of contigs have been assembled (reflecting a lower 
quality) for the whole genome sequences of many species. Fur- 
ther enhancement of the whole genome sequence assemblies is 
needed to make such assemblies useful. Many scientists have 
considered coupling traditional approaches with contemporary 
bioinformatic approaches. As such, physical maps are still cru- 
cially useful resources to improve genome assembly, especially 
for large and complex genomes. For instance, to achieve the 
assembly of the large barley genome (5.1 Gb), a new strategy 
was developed to include the construction of a sequence-enriched 
barley physical map (Mayer etal., 2012). Another important role 
of physical maps in whole genome sequencing is to orient the 
assembled contigs/scaffolds. In a pilot study of salmon genome 
sequencing, a 1-Mb genomic region was sequenced using GS 
FLX shotgun and long paired-end sequencing, resulting in 175 
contigs assembled into four scaffolds, which were then verified 
and oriented by using a BAC-based physical map and BAC end 
sequences (BES; Quinn etal., 2008). In another genome sequenc- 
ing pilot study using catfish, a physical map and BES were 
used to confirm and order the assembled genome contigs (Jiang 
etal., 2011). Lewin etal. (2009) concluded that physical maps 
are indispensable for the precision of genome assemblies, after 
comparing the quality of the genome assemblies with and with- 
out the use of physical maps. Finally, physical maps are essential 
for assessing the quality of whole genome sequence assemblies 
(Li etal, 2009; Zhang etal, 2012; Xu etal, 2013; Kim etal, 
2014). 

BAC END SEQUENCES 

Bacterial artificial chromosome end sequences are genomic survey 
sequences using BAC clones as templates with sequencing primers 
from the BAC vector. They are important genome resources, and 
the most useful BESs are mate-paired reads. As such, BESs have 
been generated from a large number of species (Budiman etal., 
2000; Yuan etal, 2000; Zhao etal, 2001; Larkin etal, 2003; Ren 
etal, 2003; Messing etal, 2004; Xu etal, 2006, 2011a; Liu etal, 
2009). 

The use of BESs in whole genome sequencing projects was first 
proposed as a tool for the identification of MTPs (Goff et al., 2002; 
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Yu etal., 2002). With next-generation sequencing, BESs remain 
helpful in the assembly and scaffolding process, in particular, for 
complex and repeat-rich genomes (Feuillet etal, 2011). This is 
because BESs are paired-reads from large inserts that span a dis- 
tance of over 100-200 Kb. For instance, the average insert size in 
the catfish BAC library is 161 Kb (Wang et al, 2007). Mate-paired 
BESs can be used to combine assembled genome scaffolds into 
superscaffolds (Quinn etal., 2008; Jiang etal., 2011). Moreover, 
BESs associated with BAC clones allow them to be related to a phys- 
ical map, thereby integrating genome sequence contigs/scaffolds 
with physical maps. 

However, a study on catfish demonstrated that BESs are not as 
powerful as expected when functioning as an anchoring point to 
link genome contigs to physical maps for two reasons (Jiang et al., 
2013): first, BESs are relatively short (Xu etal, 2006; Liu etal, 
2009); and second, the number of BESs is still limited because of 
the high cost associated with generating BESs, and even when all 
of the BAC clones are sequenced, only two end sequences can be 
generated per BAC clone. Therefore, additional sequences that are 
specific for the physical map contigs are needed to enhance the 
anchoring ability of BAC-associated sequences. 



PHYSICAL MAP CONTIG-SPECIFIC SEQUENCES 

Although BESs from physical maps can be used as sequence tags 
to anchor assembled genome sequence contigs to the BAC con- 
tigs of physical maps, they account for only 0.5-1% of all genome 
sequences. We have developed a simple strategy for the rapid gen- 
eration of extensive sequence tags from the distinct BAC contigs 
of physical maps, to allow the vast majority of assembled genome 
contigs to be anchored to physical map contigs, at a relatively low 
cost (Jiang etal, 2013). 

The core principle of physical map contig-specific sequences 
is to generate next-generation sequences with known tags specific 
for each of the BAC contigs in a physical map. Briefly, the strategy 
for generating physical map contig-specific sequences includes six 
major steps (Figure 1A): (1) select and cultivate the BAC clones 
from each physical map contig using MTP; (2) extract the BAC 
DNA, and pool the DNA representing the MTP of each BAC 
contig from the physical map; (3) digest the DNA by using two 
4-bp restriction endonucleases with different recognition sites but 
compatible overhangs; (4) individually ligate the specific barcoded 
adaptors to the fragments generated from each BAC contig from 
the physical map; (5) amplify the specific barcoded fragments 
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FIGURE 1 | (A) A flow chart illustrating the physical map contig-specific 
fragment preparation. The BAC clones from each physical map contig 
were selected, using a minimal tilling path, and then the BAC DNA 
from each physical map contig were pooled for digestion with two 
restriction endonucleases. In-house designed adaptors were then ligated 
to the digestion products, followed by amplification using in-house 
designed primers. All of the PCR products were pooled together and 
sequenced (Jiang etal., 2013). (B) Decoding based on the adaptors and 



primers. Each PCR product was attached to a physical map 
contig-specific tag at both ends. The tag representing the physical map 
contig ID contains five parts: ® represents the shields DNA, © 
represents the specific barcoded sequencing in the primer (Pj), ® 
represents the common/complementary sequences between the adaptor 
and the PCR primer, © represents the specific barcoded sequences in 
the adaptor (Ai), and © represents the overhangs proximal to the 
fragments for ligation (Jiang etal., 2013). 
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via PCR using barcoded PCR primers for the fragments generated 
from each BAC contig from the physical map; and (6) sequence the 
PCR-amplified fragments via next-generation sequencing. After 
sequencing, the sequences can be decoded based on their bar- 
codes to assign them to specific BAC contigs on the physical 
map. 

The highlighted advantages for such a strategy to generate 
physical map contig-specific sequences are its simplicity and low- 
cost. A BAC-based physical map normally consists of several 
thousand BAC contigs, which means that thousands of specific 
barcodes are required to differentiate each physical map contig- 
specific sequence. To reduce the total number of barcodes, a 
two-dimensional tagging strategy was designed, in which there 
are two separate sets of barcodes; one is attached to the adaptors 
being ligated to the restriction enzyme digested fragments, and the 
other is attached to the PCR primer for the amplification of the 
fragments. 

This approach is highly efficient. For instance, we have gen- 
erated a large number of catfish physical map contig-specific 
sequences (Jiang et al., 2013) with limited financial resources. The 
catfish physical map contains 1,824 contigs. If only a single bar- 
code is used, 1,824 specific tags are required. When adopting 
the two-dimensional tagging strategy, all 1,824 pooled BAC DNA 
were arrayed into a two-dimensional 38 (row) x 48 (column) 
setup, using twenty 96-well plates, in which the rows represent 



one set of tags for adaptors A;, where i = 1, 2, 3, . . .38, and 
the columns represent another set of tags, Pj, where j = 1, 2, 3, 
. . .48. As such, each pool of PCR products represents the frag- 
ments derived from a single physical map contig with A; and 
Pj at the ends. In this way, only 86 (38 + 48) barcodes are 
needed, but their combination (38 x 48) can generate 1,824 dis- 
tinct barcodes. As shown in Figure IB, each end of the amplified 
fragments attached to the specific barcodes consists of five parts: 
a common sequence that acts as the "shield" to keep the bar- 
codes intact, the specific barcoded sequence in the primer (Pj), 
the common/complementary sequences between the adaptor and 
PCR primer, the specific barcoded sequences in the adaptor (AO, 
and the overhangs proximal to the fragments to be ligated to the 
restriction fragments. 

One of the most important applications of physical map contig- 
specific sequences is to associate whole genome sequence contigs 
into scaffolds. The sequence assemblies obtained from each BAC 
contig in the physical map can be used to search the contigs in 
the whole genome sequence using BLAST. Upon receiving hits 
for two or more of the contigs in the whole genome sequence 
by one contig in the physical map contig-specific sequences, they 
are brought together into one contig, thereby reducing the num- 
ber of contigs in the whole genome sequence. When only one 
contig from the whole genome sequence is hit, it reveals that 
the whole genome sequence contig is associated with a specific 
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physical map contig. Therefore, the likelihood of "scaffolding" 
the contigs in the whole genome sequence is increased. For 
instance, in our study using catfish physical map contig-specific 
sequences, we compared the power of anchoring the genome 
contigs by using BESs alone with using both BESs and physi- 
cal map contig-specific sequences (Jiang etal., 2013; Figure 2). 
With the previously available BES alone, 27,770 whole genome 
sequence contigs (11% of the whole genome contigs, channel 
catfish assembly version 1.0, unpublished) had significant hits 
to the BESs. When the physical map contig-specific sequences 
were also used, the number of whole genome contigs with sig- 
nificant hits increased to 156,457. In terms of the total length of 
the genome contigs being scaffolded, over 79% of the assembled 
whole genome sequences were anchored when using both BESs 
and the physical map contig-specific sequences, but only 26% 
of the assembled whole genome sequences were anchored when 
only BESs were used. To further assess the scaffolding capacity of 
the physical map contig-specific sequences, we also determined 
the number of genes that could be anchored to the scaffolds of 
the whole genome sequences. The number of genes drastically 
increased from 6,732 when only BESs were used to 16,680 when 
both BESs and the physical map contig-specific sequences were 
used (Jiang etal., 2013). All of these results demonstrated the 
strong anchoring capability of the physical map contig-specific 
sequences. However, the order and orientation of the whole 
genome sequence contigs within the physical map contig is still 
largely unknown, unless the gaps can be filled by physical map 
contig-specific sequences. 

CONCLUSION 

Next-generation sequencing technologies have provided unprece- 
dented possibilities for genome sequencing. However, challenges 
remain in generating well-assembled reference genomes due to 
the short reads produced via the next-generation sequencing 
platforms and to the complexities of large eukaryotic genomes 
with high levels of repetitive elements. For genetic analysis, the 
anchoring of whole genome sequence contigs and scaffolds to 
chromosomes is perhaps the most important goal. Among the 
many different approaches for anchoring whole genome sequences 
to chromosomes, BES and physical map contig-specific sequences 
provide great power for linking whole genome shotgun sequence 
contigs to physical maps, thereby significantly reducing the work- 
load when using genetic linkage mapping to anchor whole genome 
sequence contigs to chromosomes through the integration of 
genetic linkage and physical maps. The generation of physical map 
contig-specific sequences is both technologically simple and cost 
effective. 
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